Battle royale games have surged in popularity in recent years. The premise of such games is as follows: players are dropped onto a fictional island and fight to be the last person standing. As they roam around the island, they loot for weapons and items crucial for their survival.
We are interested in building a prediction model for the popular battle royale game PUBG (PlayerUnknown’s Battlegrounds). In PUBG, players not only have to worry about getting killed by other players, but they also have to stay within the shrinking “safe zone,” which effectively forces players into contact with each other. Outside of the “safe zone,” players take damage to their health at increasing rates.
Through our analysis, we aim to understand which playing strategies are more successful than others: How aggressive are the playing styles of the winners? Is it better to land in a densely or sparsely populated area? Do players who travel farther on the map tend to place higher or lower? Answers to such questions will be of high interest for the PUBG gaming community.
First, we want to investigate how well we can predict a player’s placement based on their in-game actions. What actions or statistics are most predictive of their placement? Exploring this question can then provide insight into how different playing styles compare. We would like to be able to build a model that accurately predicts a player’s game performance, but also allows us to draw inferences about whether certain playing styles are more successful.
The data comes from the Kaggle competition. To download the data, join the Kaggle competition and run the shell script download_data.sh.
Note: We will need to provide a direct download link for the TA.
data.url <- paste0("https://www.dropbox.com/s/319vkfevkfb6kqt/all.zip?dl=1")
if(!file.exists("./data/pubg.zip")){
dir.create("./data")
download.file(data.url, destfile = "./data/pubg.zip", mode = "wb")
unzip("./data/pubg.zip", exdir = "./data/pubg")
}
# Warning: Very large datasets. Read 10000 samples before scaling up.
raw_dat <- read_csv("data/pubg/train_V2.csv", n_max = 10000)
test_dat <- read_csv("data/pubg/test_V2.csv")
Each row in the data contains one player’s post-game stats. A description of all data fields is provided in pubg_codebook.csv. We will focus on the solo game mode (match_type is solo, solo-fpp, or normal-solo-fpp). The solo game mode constitutes about 15% of the data. The outcome variable we are trying to predict is win_place_perc.
# Select single-player data only
# Clean names
# Remove features that are not relevant to single-players
# Change player_id and match_id to factors
clean_dat <- raw_dat %>%
clean_names() %>%
filter(match_type %in% c("solo", "solo-fpp", "normal-solo-fpp")) %>%
select(-dbn_os, -assists, -revives, -group_id, -match_type, -team_kills) %>%
mutate(id = as.factor(id), match_id = as.factor(match_id))
We are given a training set and a test set. The outcome variable for the test set will not be given to us until the end of the Kaggle competition in Jan. 30th, 2019. Therefore, for the purposes of this project, we will only be using the provided training set. Within the training set, we will create our own training and test set.
# Split into train and test set
train_ind = createDataPartition(y = clean_dat$win_place_perc, p = 0.8, list = F)
train = clean_dat %>%
slice(train_ind)
test = clean_dat %>%
slice(-train_ind)
head(train)
# A tibble: 6 x 23
id match_id boosts damage_dealt headshot_kills heals kill_place
<fct> <fct> <int> <dbl> <int> <int> <int>
1 315c… 6dc8ff8… 0 100 0 0 45
2 311b… 2926117… 0 8.54 0 0 48
3 b780… 2c30ddf… 1 324. 1 5 5
4 9202… 07948d7… 3 254. 0 12 13
5 4714… bc2faec… 0 137. 0 0 37
6 0ba4… f7cb761… 0 194. 1 1 19
# ... with 16 more variables: kill_points <int>, kills <int>,
# kill_streaks <int>, longest_kill <dbl>, match_duration <int>,
# max_place <int>, num_groups <int>, rank_points <int>,
# ride_distance <dbl>, road_kills <int>, swim_distance <dbl>,
# vehicle_destroys <int>, walk_distance <dbl>, weapons_acquired <int>,
# win_points <int>, win_place_perc <dbl>
Plot of Distribution of Features by Finish Percentile
Interesting features: * kill_place has a bimodal distribution. You have people in the 10th percentile finish who have high kills and also low kill ranks (maybe this is reflected in kill_points, people who have high kill_points will be in the category of: high kills per game (so high kills rank) but low finish percentage. ) * Some features look highly skewed (e.g. longest_kill, ride_distance, swim_distnace, etc.). Maybe we may want to look at the log-transformations of these data?
Additional plots we might want:
train %>% mutate(win_place_cat = as.factor(floor(win_place_perc * 10) * 10)) %>%
gather("feature", "value", -match_id, -match_duration,
-id, -win_place_perc, -win_place_cat) %>%
ggplot(aes(x = value, group = win_place_cat, color = win_place_cat)) +
facet_wrap(feature ~., scales = "free") +
geom_density() +
labs(title = "Distribution of Features by Finish Percentile",
x = "Value of Features", y = "Density", color = "Percentile") +
theme_bw()